This report explores a dataset containing prices and attributes for approximately 114,000 records and 81 variables in this data frame.
## [1] 113937 81
Before start analyzing this data, let’s update the order of the factor levels of the variables ‘CreditGrade’ and create a new variable called ‘ListingCategory’ to store a name of the Listing category
# Create a new variable called ListingCategory.
# We will later update this variable to store a name of the Listing Category
pl$ListingCategory <- pl$ListingCategory..numeric.
# Update the Term and ListingCategory columns to be a factor
pl <- mutate_at(pl, vars(Term, ListingCategory), as.factor)
## --- Update levels
# Update levels of CreditGrade to sorted from Highest to Lowest first
levelsCreditGrade <- c("AA", "A", "B", "C", "D", "E", "HR", "NC", "")
levels(pl$CreditGrade) <- levelsCreditGrade
# Update levels of IncomeRange
levelsIncomeRange <- c("Not displayed", "Not employed",
"$0", "$1-24,999", "$25,000-49,999",
"$50,000-74,999", "$75,000-99,999", "$100,000+")
levels(pl$IncomeRange) <- levelsIncomeRange
# Update levels of ListingCategory
levelsListingCat <- c(
"Not Available", #0
"Debt Consolidation", #1
"Home Improvement", #2
"Business", #3
"Personal Loan", #4
"Student Use", #5
"Auto", #6
"Other", #7
"Baby&Adoption", #8
"Boat", #9
"Cosmetic Procedure", #10
"Engagement Ring", #11
"Green Loans", #12
"Household Expenses", #13
"Large Purchases", #14
"Medical/Dental", #15
"Motorcycle", #16
"RV", #17
"Taxes", #18
"Vacation", #19
"Wedding Loans") #20
levels(pl$ListingCategory) <- levelsListingCat
str(pl)
## 'data.frame': 113937 obs. of 82 variables:
## $ ListingKey : Factor w/ 113066 levels "00003546482094282EF90E5",..: 7180 7193 6647 6669 6686 6689 6699 6706 6687 6687 ...
## $ ListingNumber : int 193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
## $ ListingCreationDate : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...
## $ CreditGrade : Factor w/ 9 levels "AA","A","B","C",..: 5 1 8 1 1 1 1 1 1 1 ...
## $ Term : Factor w/ 3 levels "12","36","60": 2 2 2 2 2 3 2 2 2 2 ...
## $ LoanStatus : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
## $ ClosedDate : Factor w/ 2803 levels "","2005-11-25 00:00:00",..: 1138 1 1263 1 1 1 1 1 1 1 ...
## $ BorrowerAPR : num 0.165 0.12 0.283 0.125 0.246 ...
## $ BorrowerRate : num 0.158 0.092 0.275 0.0974 0.2085 ...
## $ LenderYield : num 0.138 0.082 0.24 0.0874 0.1985 ...
## $ EstimatedEffectiveYield : num NA 0.0796 NA 0.0849 0.1832 ...
## $ EstimatedLoss : num NA 0.0249 NA 0.0249 0.0925 ...
## $ EstimatedReturn : num NA 0.0547 NA 0.06 0.0907 ...
## $ ProsperRating..numeric. : int NA 6 NA 6 3 5 2 4 7 7 ...
## $ ProsperRating..Alpha. : Factor w/ 8 levels "","A","AA","B",..: 1 2 1 2 6 4 7 5 3 3 ...
## $ ProsperScore : num NA 7 NA 9 4 10 2 4 9 11 ...
## $ ListingCategory..numeric. : int 0 2 0 16 2 1 1 2 7 7 ...
## $ BorrowerState : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
## $ Occupation : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
## $ EmploymentStatus : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
## $ EmploymentStatusDuration : int 2 44 NA 113 44 82 172 103 269 269 ...
## $ IsBorrowerHomeowner : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
## $ CurrentlyInGroup : Factor w/ 2 levels "False","True": 2 1 2 1 1 1 1 1 1 1 ...
## $ GroupKey : Factor w/ 707 levels "","00343376901312423168731",..: 1 1 335 1 1 1 1 1 1 1 ...
## $ DateCreditPulled : Factor w/ 112992 levels "2005-11-09 00:30:04.487000000",..: 14347 111883 6446 64724 85857 100382 72500 73937 97888 97888 ...
## $ CreditScoreRangeLower : int 640 680 480 800 680 740 680 700 820 820 ...
## $ CreditScoreRangeUpper : int 659 699 499 819 699 759 699 719 839 839 ...
## $ FirstRecordedCreditLine : Factor w/ 11586 levels "","1947-08-24 00:00:00",..: 8639 6617 8927 2247 9498 497 8265 7685 5543 5543 ...
## $ CurrentCreditLines : int 5 14 NA 5 19 21 10 6 17 17 ...
## $ OpenCreditLines : int 4 14 NA 5 19 17 7 6 16 16 ...
## $ TotalCreditLinespast7years : int 12 29 3 29 49 49 20 10 32 32 ...
## $ OpenRevolvingAccounts : int 1 13 0 7 6 13 6 5 12 12 ...
## $ OpenRevolvingMonthlyPayment : num 24 389 0 115 220 1410 214 101 219 219 ...
## $ InquiriesLast6Months : int 3 3 0 0 1 0 0 3 1 1 ...
## $ TotalInquiries : num 3 5 1 1 9 2 0 16 6 6 ...
## $ CurrentDelinquencies : int 2 0 1 4 0 0 0 0 0 0 ...
## $ AmountDelinquent : num 472 0 NA 10056 0 ...
## $ DelinquenciesLast7Years : int 4 0 0 14 0 0 0 0 0 0 ...
## $ PublicRecordsLast10Years : int 0 1 0 0 0 0 0 1 0 0 ...
## $ PublicRecordsLast12Months : int 0 0 NA 0 0 0 0 0 0 0 ...
## $ RevolvingCreditBalance : num 0 3989 NA 1444 6193 ...
## $ BankcardUtilization : num 0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
## $ AvailableBankcardCredit : num 1500 10266 NA 30754 695 ...
## $ TotalTrades : num 11 29 NA 26 39 47 16 10 29 29 ...
## $ TradesNeverDelinquent..percentage. : num 0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
## $ TradesOpenedLast6Months : num 0 2 NA 0 2 0 0 0 1 1 ...
## $ DebtToIncomeRatio : num 0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
## $ IncomeRange : Factor w/ 8 levels "Not displayed",..: 4 5 7 4 3 3 4 4 4 4 ...
## $ IncomeVerifiable : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
## $ StatedMonthlyIncome : num 3083 6125 2083 2875 9583 ...
## $ LoanKey : Factor w/ 113066 levels "00003683605746079487FF7",..: 100337 69837 46303 70776 71387 86505 91250 5425 908 908 ...
## $ TotalProsperLoans : int NA NA NA NA 1 NA NA NA NA NA ...
## $ TotalProsperPaymentsBilled : int NA NA NA NA 11 NA NA NA NA NA ...
## $ OnTimeProsperPayments : int NA NA NA NA 11 NA NA NA NA NA ...
## $ ProsperPaymentsLessThanOneMonthLate: int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPaymentsOneMonthPlusLate : int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPrincipalBorrowed : num NA NA NA NA 11000 NA NA NA NA NA ...
## $ ProsperPrincipalOutstanding : num NA NA NA NA 9948 ...
## $ ScorexChangeAtTimeOfListing : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanCurrentDaysDelinquent : int 0 0 0 0 0 0 0 0 0 0 ...
## $ LoanFirstDefaultedCycleNumber : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanMonthsSinceOrigination : int 78 0 86 16 6 3 11 10 3 3 ...
## $ LoanNumber : int 19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
## $ LoanOriginalAmount : int 9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
## $ LoanOriginationDate : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
## $ LoanOriginationQuarter : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 18 8 2 32 24 33 16 16 33 33 ...
## $ MemberKey : Factor w/ 90831 levels "00003397697413387CAF966",..: 11071 10302 33781 54939 19465 48037 60448 40951 26129 26129 ...
## $ MonthlyLoanPayment : num 330 319 123 321 564 ...
## $ LP_CustomerPayments : num 11396 0 4187 5143 2820 ...
## $ LP_CustomerPrincipalPayments : num 9425 0 3001 4091 1563 ...
## $ LP_InterestandFees : num 1971 0 1186 1052 1257 ...
## $ LP_ServiceFees : num -133.2 0 -24.2 -108 -60.3 ...
## $ LP_CollectionFees : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_GrossPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NetPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NonPrincipalRecoverypayments : num 0 0 0 0 0 0 0 0 0 0 ...
## $ PercentFunded : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Recommendations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsCount : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsAmount : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Investors : int 258 1 41 158 20 1 1 1 1 1 ...
## $ ListingCategory : Factor w/ 21 levels "Not Available",..: 1 3 1 17 3 2 2 3 8 8 ...
CreditGrade
##
## AA A B C D E HR NC
## 84984 3315 3509 4389 5649 5153 3289 3508 141
ListingCreatingDate
Let’s take a closely look at a count of listings of each CreditGrade.
It is very interesting that:
From the discovery above, I assume that we only need to focus on the CreditGrade ‘AA’ from 2010 for this analysis since a number of count before 2010 is closed to 0.
See this link for more details about the propser loan
https://www.lendingmemo.com/prosper-loan-complaints/
New data set
From this point on, we will analyze data where the ListingCreationDate is greater than or equal to 2010 and there are almost 83,000 records in the new data set
# Get a subset of data which has the ListingCreationData >= 2010
new_pl <- subset(pl, year(ListingCreationDate) >= 2010)
dim(new_pl)
## [1] 82675 82
BorrowerRate
In average, the BorrowRate is around 0.15 except a very high number of applications with the Borrow rate around 3.25. Why is that the case?
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0400 0.1364 0.1875 0.1961 0.2573 0.3600
Occupation
Let’s zoom in data by occupation. The charts below show that most applications have the Occupations set to ‘Other’ and the LoanOriginalAmount of this category is varied a lot comparing to other occupations.
EmploymentStatus
Most applications have the EmploymentStatus ‘Employed’ and the median LoanOriginalAmount of this category is higher than others.
BorrowerState
Now, let’s take a look at a number of Listings by each state. Looks like major cities have a higher number of applications.
From the summary in terms of a number of applications in each state, ‘Wyoming’ (WY) has the lowest number of applications which is 119 and ‘California’ (CA) has the highest number of applications which is 10,479.
The median number of applications is 1161.5 and average number is 1722.4.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 119.0 427.2 1161.5 1722.4 2229.5 10479.0
The boxplot below shows that there are four states where a number of applications far exceeds the majority (outliers).
When combining information from the bar chart above and the boxplot below we can identify that those four states are California (CA), New York (NY), Texas (TX), and Florida (FL). Also notice that a number of applications in CA is also far more than the other three states.
The maps below make it clear that a number of listings is higher in the major cities.
Term
The histogram below shows that most applications have the 36-Term.
IncomeRange
Majority of the applications have the IncomeRange between $1-$50,000. There are also applications with IncomeRange = 0. Did they get a loan?
DebtToIncomeRatio
From the histogram and boxplot, most of the applications have the DebtToIncomeRatio less than 1.
However, there are also outliers where the DebtToIncomeRatio > 1 and up to DebtToIncomeRatio = 10.01. What is a status of those applications?
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1500 0.2200 0.2596 0.3200 10.0100
The frequency polygon below shows that most applications have the DebtToIncomeRatio around 0.2.
ListingCategory
Most applications apply the Proper loan for ‘Debt Consolidation’ with the median LoanOriginalAmout around $10,000
CreditScoreRange
The median of the CreditScoreRangeLower is 700 and the median of the CreditScoreRangeUpper is 719.
## [1] "*** Summary of CreditScoreRangeLower ***"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 600.0 660.0 700.0 699.2 720.0 880.0
## [1] "*** Summary of CreditScoreRangeUpper ***"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 619.0 679.0 719.0 718.2 739.0 899.0
There are approximately 114,000 records and 81 variables. After analyzing the ListingCreatingDate variable, We found that data with various CreditGrade were not created after 2009, and only data with the CreditGrade ‘AA’ was created after 2010. Hence, I only filtered data with the ListingCreatingDate is greater than or equal to 2010 for this exploration.
With the new dataset, there are approximately 83,000 records with 82 variables where an additional factor variable ‘ListingCategory’ was created to store a ListingCategory name of each record
I am paricularly interested in finding variables that are best for describing relationship with the BorrowerRate variable.
I think the following variables will correlate to the BorrowerRate; Occupation, EmploymentStatus, BorrowerState, Term, IncomeRange, DebtToIncomeRatio, ListingCategory, CreditScoreRangeLower, and CreditScoreRangeUpper.
As mentioned in the previous section, I created a new variable ‘ListingCategory’ to store a text of each category to accomodate this analysis.
I found an unsual trend in the charts that show a number of listings based on the ListingCreatingDate and found that there are no records with various CreditGrade after 2010, and only records with the CreditGrade ‘AA’ have the ListingCreatingDate greater than or equal to 2010 so I decided to only analyze records where the ListingCreatingDate >= 2010.
Also, I updated levels of the variables ‘CreditGrade’ and ‘IncomeRange’ to ease a process of creating visualizations for those variables so that an appropriate level’s order is shown in the plots.
Scatterplot Matrix
Let’s start off by creating a scatterplot matric with the variables below:
BorrowerRate, LoanOriginalAmount, StatedMonthlyIncome, DebtToIncomeRatio, CreditScoreRangeLower, CreditScoreRangeUpper, Term, EmploymentStatus
Note that we did not choose the ‘Occupation’, ‘BorrowerState’ and ‘ListingCategory’ in this plot because they have a lot of levels
The correlation matrix shows that the ‘LoanOriginalAmount’, ‘CreditScoreRangeLower’, and ‘CreditScoreRangeUpper’ variables have a strong correlation with the ‘BorrowerRate’ variable
BorrowerRate vs LoanOriginalAmount
Let’s take a look at a scatter plot of BorrowerRate (y) vs LoanOriginalAmount(x)
The correlation coefficient between BorrowerRate and LoanOriginalAmount is around -0.42 which shows a moderate correlation between those two variables
##
## Pearson's product-moment correlation
##
## data: LoanOriginalAmount and BorrowerRate
## t = -134.57, df = 82673, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4294568 -0.4182732
## sample estimates:
## cor
## -0.4238811
BorrowerRate vs CreditScoreRange
Let’s take a look at a scatter plot of BorrowerRate (y) vs CreditScoreRangeLower(x) and a scatter plot of BorrowerRate (y) vs CreditScoreRangeUpper(x)
The smoothed conditional mean line shows that the BorrowerRate is decreased when the CreditScoreRangeLower or CreditScoreRangeUpper is increased
The correlation coefficient between BorrowerRate and CreditScoreRangeLower is around -0.5 which shows a moderate correlation between those two variables
##
## Pearson's product-moment correlation
##
## data: CreditScoreRangeLower and BorrowerRate
## t = -166.77, df = 82673, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5068121 -0.4966108
## sample estimates:
## cor
## -0.5017289
The correlation coefficient between BorrowerRate CreditScoreRangeUpper is around -0.5 which shows a moderate correlation between those two variables
##
## Pearson's product-moment correlation
##
## data: CreditScoreRangeUpper and BorrowerRate
## t = -166.77, df = 82673, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5068121 -0.4966108
## sample estimates:
## cor
## -0.5017289
BorrowerRate vs BorrowerState
The boxplot below confirms that the BorrowerState does not have impact on the BorrowerRate since the median value of the BorrowerRate of all states is around 0.2 and the boxplots are closely aligned.
BorrowerRate vs Occupation
The boxplot below also shows that Occuputation does not correlate to the BorrowerRate since the median BorrowerRate is around 0.2 excepts the Occupations with a small number of applications that shows lower BorrowerRate, (for example, Judge)
BorrowerRate vs ListingCategory
The boxplot below also shows that ListingCategory does not correlate to the BorrowerRate since the median BorrowerRate is around 0.2.
BorrowerRate vs StatedMonthlyIncome
Let’s take a look at a scatter plot of BorrowerRate (y) vs StatedMonthlyIncome(x).
This scatter plot shows multiple outliers that is difficult to see a relation between the BorrowerRate and StatedMonthlyIncome.
So what is a summary of StatedMonthlyIncome?
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 3467 5000 5953 7100 1750003
Let’s create a scatter plot of BorrowerRate (y) vs StatedMonthlyIncome(x) where StatedMonthlyIncome is in the 95% percentile.
It looks like the BorrowerRate is slightly decreased when the StatedMonthlyIncome is increased.
The correlation coefficient between BorrowerRate and StatedMonthlyIncome is around -0.25 which shows a weak correlation between those two variables
##
## Pearson's product-moment correlation
##
## data: StatedMonthlyIncome and BorrowerRate
## t = -71.343, df = 78796, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2528729 -0.2397558
## sample estimates:
## cor
## -0.2463256
BorrowerRate vs DebtToIncomeRatio Let’s take a look at a scatter plot of BorrowerRate (y) vs DebtToIncomeRatio(x).
This scatter plot shows multiple outliers that is difficult to see a relation between the BorrowerRate and DebtToIncomeRatio
What is a summary of DebtToIncomeRatio?
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 0.15 0.22 0.26 0.32 10.01 7084
Let’s create a scatter plot of BorrowerRate (y) vs DebtToIncomeRatio(x) where DebtToIncomeRatio is in the 99% percentile.
It looks like the BorrowerRate is increased when the DebtToIncomeRatio is increased.
The correlation coefficient between BorrowerRate and DebtToIncomeRatio is around 0.18 which shows a weak correlation between those two variables
##
## Pearson's product-moment correlation
##
## data: DebtToIncomeRatio and BorrowerRate
## t = 49.063, df = 74848, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1695706 0.1834521
## sample estimates:
## cor
## 0.1765201
LoanOriginalAmount vs StatedMonthlyIncome
Let’s take a look at the relationship between LoanOriginalAmount and StatedMonthlyIncome.
The scatter plot of LoanOriginalAmount(y) and 95 percentile of StatedMonthlyIncome (x) shows that the LoanOriginalAmount is increased when the StatedMonthlyIndome is increased.
The correlation coefficient between LoanOriginalAmount and 95 percentile of StatedMonthlyIncome is 0.41 which means that these two variables have a moderate correlation.
##
## Pearson's product-moment correlation
##
## data: StatedMonthlyIncome and LoanOriginalAmount
## t = 127.01, df = 78796, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4064179 0.4180093
## sample estimates:
## cor
## 0.4122303
There are a few features, which are LoanOriginalAmount, CreditScoreRangeLower, CreditScoreRangeLower, that have a moderate correlation with the BorrowerRate.
Additionally, there are a few, which are StatedMonthlyIncome and DebtToIncomeRatio, that shows a weak correlation with the BorrowerRate after removing the outlier.
However, other features, such as BorrowerState, Occupation, ListingCategory, do not seem to have any strong relationship with the BorrowerRate
The StatedMonthlyIncome and LoanOriginalAmount variables have a moderate correlation where the LoanOriginalAmount is increased when the StatedMonthlyIncome is increased
The CreditScoreRangeLower/Update have the strongest relationship with the BorrowerRate where their correlation coefficient is around -0.5
BorrowerRate, LoanOriginalAmount | Term
Let’s take a look at a scatter plot of BorrowerRate (y) vs LoanOriginalAmount(x) and use a different color for each Term.
It looks like most smaller LoanOriginalAmout has 36 Term but it is hard to tell because of overplotting. Let’s use a facet_wrap instead.
The plots below show that most of the applications have lower LoanOriginalAmount for all Terms. Also, it looks like the LoanOriginalAmount is usually about x$5000 (darker points) as we see in the plots
The conditional mean chart for different Term shows that the BorrowerRate of the 12-Term is the lowest regardless of the LoanOriginalAmount, follow by the 36-Term and 60-Term.
BorrowerRate, LoanOriginalAmount | Occupation, BorrowerState, and ListingCategory
The facet plots below show that Occupation, BorrowerState, and ListingCategory does not have a strong relationship with the BorrowRate and LoanOriginal amount since all conditional mean plots show a similar Trend.
Additionally, most applications with the LoanOriginalAmout less than $10,000 have the 36-Term with various BorrowerState regardless of Occupation, BorrowerState, and ListingCategory.